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Abstract 

In order for an agent to perform well in par- 
tially observable domains, it is usually nec- 
essary for actions to depend on the history 
of observations. In this paper, we explore 
a stigmergic approach, in which the agent's 
actions include the ability to set and clear 
bits in an external memory, and the exter- 
nal memory is included as part of the input 
to the agent. In this case, we need to learn a 
reactive policy in a highly non-Markovian do- 
main. We explore two algorithms: SARSA(A), 
which has had empirical success in partially 
observable domains, and VAPS, a new algo- 
rithm due to Baird and Moore, with conver- 
gence guarantees in partially observable do- 
mains. We compare the performance of these 
two algorithms on benchmark problems. 



1 Introduction 

A reinforcement-learning agent must learn a mapping 
from a stream of observations of the world to a stream 
of actions. In completely observable domains, it is suf- 
ficient to look only at the last observation, so the agent 
can learn a "memoryless" mapping from observations 
In general, however, the agent's ac- 



to actions [16 



tions may have to depend on the history of previous 
observations. 

Previous Work There have been many approaches 
to learning to behave in partially observable domains. 
They fall roughly into three classes: optimal memory- 
less, finite memory, and model-based. 



able domains, memoryless policies can actually per- 
form fairly well. Basic reinforcement-learning tech- 
niques, such as Q-learning | |25[ |, often perform poorly 
in partially observable domains, due to a very strong 
Markov assumption. Liftman showed |jl^ that find- 
ing the optimal memoryless policy is NP-Hard. How- 
ever, Loch and Singh |l^ effectively demonstrated 
that techniques, such as SARSA(A), that are more ori- 
ented toward optimizing total reward, rather than 
Bellman residual, often perform very well. In addi- 
tion, Jaakkola, Jordan, and Singh have developed 
an algorithm for finding stochastic memoryless poli- 
cies, which can perform significantly better than de- 



terministic ones 1 20 



One class of finite memory methods are the finite- 
horizon memory methods, which can choose actions 
based on a finite window of previous observations. For 
many problems this can be quite effective [|l^, p^ . 
More generally, we may use a finite-size memory, which 
can possibly be infinite-horizon (the systems remem- 
bers only a finite number of events, but these events 
can be arbitrarily far in the past). Wiering and 
Schmidhuber |2^ proposed such an approach, learn- 
ing a policy that is a finite sequence of memoryless 
policies. 

Another class of approaches assumes complete knowl- 
edge of the underlying process, modeled as a par- 
tially observable Markov decision process (POMDP). 
Given a model, it is possible to attempt optimal so- 
lution Q , or to search for approximations in a variety 
of ways ^, ^, These methods can, in principle, 
be coupled with techniques, such as variations of the 
Baum- Welch algorithm jl^ , for learning the model to 
yield model-based reinforcement-learning systems. 



The first strategy is to search for the best possi- 
ble memoryless policy. In many partially observ- 



Stigmergy In this paper, we pursue an approach 
based on stigmergy. The term is defined in the Ox- 
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Figure 1: The architecture of a stigmergic policy. 



ford English Dictionary as "The process by which 
the results of an insect's activity act as a stimulus to 
further activity," and is used in the mobile robotics 
literature |Q to describe activity in which an agent's 
changes to the world affect its future behavior, usually 
in a useful way. 

One form of stigmergy is the use of external memory 
devices. We are all familiar with practices such as 
making grocery lists, tying a string around a finger, 
or putting a book by the door at home so you will 
remember to take it to work. In each case, an agent 
needs to remember something about the past and does 
so by modifying its external perceptions in such a way 
that a memoryless policy will perform well. 

We can apply this approach to the general problem 
of learning to behave in partially observable environ- 
ments. Figure |] shows the architectural idea. We 
think of the agent as having two components: one is 
a set of memory bits; the other is a reinforcement- 
learning agent. The reinforcement-learning agent has 
as input the observation that comes from the environ- 
ment, augmented by the memory bits. Its output con- 
sists of the original actions in the environment, aug- 
mented by actions that change the state of the mem- 
ory. If there are sufficient memory bits, then the opti- 
mal memoryless policy for the internal agent will cause 
the entire agent to behave optimally in its partially ob- 
servable domain. 

Consider, for instance, the load-unload problem rep- 
resented in Figure ||. In this problem, the agent is 
a cart that must drive from an Unload location to a 
load location, and then back to unload. This problem 
is a simple MDP with a one-bit hidden variable that 
makes it non-Markov (the agent cannot see whether 
it is loaded or not). It can be solved using a one-bit 
external memory: we set the bit when we make the 
Unload observation, and we go right as long as it is set 
to this value and we do not make the Load observation. 
When we do make the Load observation, we clear the 
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Figure 2: The state-transition diagram of the load- 
unload problem; aliased states are grouped by dashed 
boxes. 



bit and we go left as long as it stays at value 0, until 
we reach state 9, getting a reward. 

There are two alternatives for designing an architec- 
ture with external memory: 

• Either we augment the action space with actions 
that change the content of one of the memory bits 
(adds L new actions if there are L memory bits); 
changing the state of the memory may require 
multiple steps. 

• Or we compose the action space with the set of all 
possible values for the memory (the size of the ac- 
tion space is then multiplied by 2^, if there are L 
bits of memory). In this case, changing the exter- 
nal memory is an instantaneous action that can be 
done at each time step in parallel with a primitive 
action, and hence we can reproduce the optimal 
policy of the load-unload problem, without taking 
additional steps. 

Complexity considerations usually lead us to take the 
first option. It introduces a bias, since we have to lose 
at least one time-step each time we want to change 
the content of the memory. However, it can be fixed 
in most algorithms by not discounting memory-setting 
actions. 

The external-memory architecture has been pursued 
in the context of classifier systems [|| and in the con- 
text of reinforcement learning by Liftman [p^ and by 
Martin Liftman's work was model-based; it as- 

sumed that the model was completely known and did 
a branch-and-bound search in policy space. Martin 
worked in the model-free reinforcement-learning do- 
main; his algorithms were very successful at find- 
ing good policies for very complex domains, including 
some simulated visual search and block-stacking tasks. 
However, he made a number of strong assumptions and 



restrictions: task domains are strictly goal-oriented; 
it is assumed that there is a deterministic policy that 
achieves the goal within some specified number of steps 
from every initial state; and there is no desire for op- 
timality in path length. 

This work We were inspired by the success of 
Martin's algorithm on a set of difficult problems, but 
concerned about its restrictions and a number of de- 
tails of the algorithm that seemed relatively ad hoc. At 
the same time, Baird and Moore's work on VAPS a 
general method for gradient descent in reinforcement 
learning, appealed to us on theoretical grounds. This 
paper is the result of attempting to apply VAPS algo- 
rithms to stigmergic policies, and understanding how 
it relates to Martin's algorithm. In this process, we 
have derived a much simpler version of VAPS for the 
case of highly non-Mar kovian domains: we calculate 
the same gradient as VAPS, but with much less com- 
putational effort. 

In the next section, we present the relevant learning 
algorithms. Then we describe a set of experimental 
domains and discuss the relative performance of the 
algorithms. 

2 Algorithms 

We begin by describing the most familiar of the algo- 
rithms, SARSA(A). We then describe the VAPS algo- 
rithm in some detail, followed by our simplified ver- 
sion. 

2.1 sarsa(A) 

SARSA is an on-policy temporal-difference control 
learning algorithm |2^ . Given an experience in the 
world, characterized by starting state x, action a, re- 
ward r, resulting state x' and next action a', the up- 
date rule for SARSA(O) is 

Q{x, a) ^ Q{x, a) + a[r + ^Q{x' , a') - Q{x, a)] . (1) 

It differs from the classical Q-learning algorithm |Q 
in that, rather than using the maximum Q- value from 
the resulting state as an estimate of that state's value, 
it uses the Q- value of the resulting state and the action 
that was actually chosen in that state. Thus, the val- 
ues learned are sensitive to the policy being executed. 

In truly Markov domains, Q-learning is usually the 
algorithm of choice; policy-sensitivity is often seen 
as a liability, because it makes issues of exploration 
more complicated. However, in non-Markov domains. 



policy-sensitivity is actually an asset. Because ob- 
servations do not uniquely correspond to underlying 
states, the value of a policy depends on the distribu- 
tion of underlying states given a particular observa- 
tion. But this distribution generally depends on the 
policy. So, the value of a state, given a policy, can 
only be evaluated while executing that policy. In fact, 
Q-learning can be shown to fail to converge on very 
simple non-Markov domains Note that, when 

SARSA is used in a non-Markovian environment, the 
symbols x and x' in equation represent observa- 
tions, which usually can correspond to several states. 

The SARSA algorithm can be augmented with an eli- 
gibility trace, to yield the SARSA(A) algorithm (a de- 
tailed exposition is given by Sutton and Barto [2^ .) 
With the parameter A set to 0, SARSA(A) is just SARSA. 
With A set to 1, it is a pure Monte Carlo method, in 
which, at the end of every trial, each state-action pair 
is adjusted toward the cumulative reward received on 
this trial after the state-action pair occurred. Pure 
Monte-Carlo algorithms make no attempt at satisfy- 
ing Bellman equations relating the values of subse- 
quent states; in partially observable domains, it is of- 
ten impossible to satisfy the Bellman equation, mak- 
ing Monte-Carlo a reasonable choice. SARSA(A) de- 
scribes a useful class of algorithms, then, with appro- 
priate choice of A depending on the problem. Thus, 
SARSA(A) with a large value of A seems like the most 
appropriate of the conventional reinforcement-learning 
algorithms for solving partially-observable problems. 

2.2 VAPS 

Baird and Moore have derived, from first principles, 
a class of stochastic gradient-descent algorithms for 
reinforcement learning. At the most abstract level, we 
seek to minimize some measure of the expected cost of 
our policy; we can describe this high-level criterion as 

oo 

T=OseST 

where St is the set of all possible experience sequences 
that terminate at time T. That is, 

s= {xQ,uo,ro,...,Xt,Ut,rt, ...,XT,UT,rT) , 

where Xt, ut, and rt are the observation, action, and 
reward at step t of the sequence, and xt is an ob- 
servation associated with a terminal state. The loss 
incurred by a sequence s is e{s). We restrict our at- 
tention to time-separable loss functions, which can be 



written as 



e(s) — 2_] e(tranc(s, t)) , for all s € St, 

t=o 

where e(s) is an instantaneous error function as- 
sociated with each (finite) sequence prefix s ~ 
{xo,uo,ro, ...,Xt,Ut,rt) (xt being any observation, not 
necessarily a terminal one), and trunc(s, i) represent- 
ing the sequence s truncated after time t. For instance, 
an error measure closely related to Q-learning is the 
squared Bellman residual: 

eQL(s) =^ ^ Pr(a;t = x \ Ut_i) 

X 

[rt-i -f max 7(3(2;, u) - Q{xt-i,ut-i)f . 

u 

The SARSA version of the algorithm uses the following 
error measure: 

Csarsa('S) = 

i^Pr(xt = X I Xt_i,Ut_i)^Pr(ut ^u\xt) 

X U 

[rt-i + jQ{x,u) - Q{xt-i,ut-i)f . 



Note that we average over all possible actions ut ac- 
cording to their probability of being chosen by the pol- 
icy instead of picking the one that maximizes Q-values 
as in eQL. Baird and Moore also consider a kind of pol- 



icy search, which is analogous to reinforce |27 

epolicy(s) = - 7''"t : 

where b is any constant. This immediate error is 
summed over all time t, leading to a summation of 
all discounted immediate rewards 7*rt. In order to 
obtain the good properties of both criteria, they con- 
struct a final criterion that is a linear combination of 
the previous two: 

6 = (1 ^ /3)6sARSA + /3epolicy ■ 

This criterion combines Value And Policy Search and 
is, hence, called VAPS. We will refer to it as VAPS(/3), 
for different values of (3. 

Baird and Moore show that the gradient of the global 
error with respect to weight k can be written as: 



d 
dwi 



00 



^ = 2. 2. Pr(.) 



t=o seSt 



_^e(.) + e(.)5:^lnPr(.,_i|x,_0 



where St is the set of all experience prefixes of length 
t. Technically, it is necessary that Pr(u* — u \ xt = 
x) > for all (x,u) (otherwise, some zero probability 
trajectories may have a non-zero contribution to the 
gradient of _B j|]). In this work we use the Boltzmann 
law for picking actions, which guarantees this property 
(see section 2.3). 

It is possible to perform stochastic gradient descent 
of the "error" B, by repeating several trials of inter- 
action with the process. Each experimental trial of 
length T provides one sample of s G 5t for each t < T. 
Of course, these samples are not independent, but it 
does not matter since we are summing them and not 
multiplying them. We are thus using stochastic ap- 
proximation to estimate the expectation over s ^ St 
in the above equation. During each trial, the weights 
are kept constant and the approximate gradients of the 
error at each time t, 

^e(.)-|-e(.)^^lnPr(«,_i|x,_i), 

are accumulated. Weights are updated at the end of 
each trial, using the sum of these immediate gradients. 
An incremental implementation of the algorithm can 
be obtained by using, at every step t, the following 
update rules: 

d 

■ lnPr(Mt_i|xt_i) , 



dwk 



Awk = —a 



d 
dwk 



e{st) + e{st)Tk,t 



where St represents the experience prefix 
(xq, mq, To, . . . ,xt,ut,rt), i.e., the history at time t. 
Note that the "exploration trace" Tk.t is independent 
of the immediate error e used. It only depends on the 
way the output Pr(ut = u\xt) varies with the weights 
Wfc, i.e., on the representation chosen for the policy. 

The gradient of the immediate error e with respect to 
the weight Wk is easy to calculate. For instance, in the 
case of the SARSA variant of the algorithm we have: 

^ r A- 

dwk 

^Pr{xt = X I Xt-i,Ut-i)^Pr{ut ^ u \ Xt) 

X U 

[rt-i +-fQ{x,u) - Q{xt-i,ut-i)] 



d 
dwk 



Q{x,u) 



d 

dwk 



Q(xt_i,ut_i) 



Once more, we descend this gradient by stochastic ap- 
proximation: the averaging over xt and ut is replaced 



by a sampling of these quantities. However, since these 
variables appear twice in the equation and they are 
not just added, we have to sample both xt and ut in- 
dependently two times in order to avoid any bias in the 
estimation of the gradient. It is not realistic to satisfy 
this requirement in a truly on-line situation, since the 
only way to get a new observation is by actually per- 
forming the action. Note that for the case /3 = 1 we do 
not need the second sample, so the VAPS(l) algorithm 
is effective in the on-line case. 

In the case of policy search we have: 5£^epoiicy(s) ~ 0, 
for all Wk- This may seem strange; but for policy 
search, the important thing is the state occupations, 
which enter into the weight updates through the trace. 

2.3 VAPS(l) 

In this section, we explore a special case of VAPS, in 
which the Q-values are stored in a look-up table. That 
is, there is one weight = Q{x,u) for each state- 
action pair. Note that it is not necessary to use the 
VAPS sequence-based gradient in a look-up table imple- 
mentation of QL or SARSA, as long as it is confined to 
a Markovian environment. However, it makes sense to 
use it in the context of pomdps. Under this hypoth- 
esis, the exploration trace Tk.t associated with each 
parameter Q{x,u) will be written Tx^u,t- 

We will also focus on a very popular rule for randomly 
selecting actions as a function of their Q- value, namely 
the Boltzmann law: 



the exploration trace Tx,u,t takes a very simple form: 



Pr(ut = u\xt = x) = 



^Q{x,u)/c 



T > 



where c is a temperature parameter]^ Under this rule 
we get: 

31nPr(ut ~ u\xt = x) 



dQ{x',u') 

if cc' ^ X, 

— Pr(ut = u'\xt = x)/c if x' — X and u' ^ u, 

[1 — Pr{ut — u\xt — x)] /c if x' — X and u' = u. 



TV* 

x.u 

TV* 



Nl Pr [ut 

E[Nlu]] 



U\Xt 



(2) 



where iV* „ is the number of times that action u has 
been executed in state x at time t, TV* is the number 
of times that state x has been visited at time i, and 
i?[7V* „] represents the expected number of times we 
should have performed action u in state x, knowing 
our exploration policy and our previous history. 

As a result of equation (|^), VAPS using epoUcy as imme- 
diate error, look-up tables and Boltzmann exploration 
reduces to a very simple algorithm. At each time-step 
where the current trial does not complete, we just in- 
crement the counter N^. „ of the current state-action 
pair. When the trial completes, this trace is used to 
update all the Q-values, as described above. 

It is interesting to try to understand the properties and 
implications of this simple rule. First, a direct conse- 
quence is that when something surprising happens, the 
algorithm adjusts the unlikely actions more than the 
likely ones. In other words, this simple procedure is 
very intuitive, since it assigns credit to state-action 
pairs proportional to the deviation from the expected 
behaviour. Note that SARSA(A) is not capable of such 
a discrimination. This difference in behaviour is illus- 
trated in the simulation results. 

A second interesting property is that the Q-value up- 
dates tend to as the length of the trial tends to 
infinity. This also makes sense, since the longer the 
trial, the less the final information received (the final 
reward) is relevant in evaluating each particular ac- 
tion. Alternatively, we could say that when too many 
actions have been performed, there is no reason to at- 
tribute the final result more to one of them than to 
others. Finally, unlike with Baird and Moore's version 
of the Boltzmann law, the sum of the updates to the 
Q- values on every step is 0. This makes it more likely 
that the weights will stay bounded. 

3 Experiments 



In this case, and if we add the hypothesis that the 
problem is an achievement task, i.e., the reward is al- 
ways except when we reach an absorbing goal state, 

^Note that Baird and Moore use an unusual version of 
the Boltzmann law, with l-\-e^ in place of in both the nu- 
merator and the denominator. We have found that it com- 
plicates the mathematics and worsens the performance, so 
we will use the standard Boltzmann law throughout. 



Domains We have experimented with SARSA and 
VAPS on five simple problems. Two are illustrative 
problems previously used in the reinforcement-learning 
literature; two others are instances of load- unload with 
different parameters; and the fifth is a variant of load- 
unload designed by us in an attempt to demonstrate a 
situation in which VAPS might outperform SARSA. The 
five problems are : 



Original Load-Unload Problem 
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Figure 3: The state-transition diagram of the load- 
unload problem with two loading locations; ahased Figure 4: Learning curves for VAPS and SARSA on the 
states are grouped by dashed boxes. load- unload problem (one loading location). 



• Baird and Moore's problem designed to illus- 
trate the behavior of VAPS, 

• McCallum's 11-state maze ||l2|, which has only 6 
observations. 

• The load-unload problem, as described above, in 
which there are three locations (the loading loca- 
tion, the unloading location, and one intermediate 
one), 

• A five- location load- unload problem (fig. ^), and 

• A variant of the load-unload problem where a 
second loading location has been added, and the 
agent is punished instead of rewarded if it gets 
loaded at the wroiig location. The state space 
is shown in figure y; states contained in a box 
are observationally indistinguishable to the agent. 
The idea here is that there is a single action that, 
if chosen, ruins the agent's long-term prospects. 
If this action is chosen due to exploration, then 
SARSA(A) will punish all of the action choices 
along the chain but VAPS will punish only that 
action. 

All these domains have a single starting state, except 
McCallum's problem, where the starting state is cho- 
sen uniformly at random. 

Algorithmic Details For each problem, we ran two 
algorithms: VAPS(l) and SARSA(I). The optimal pol- 
icy for Baird's problem is memoryless, so the algo- 
rithms were applied directly in that case. For the other 
problems, we augmented the input space with an ad- 
ditional memory bit, and added two actions: one for 
setting the bit and one for clearing it. 



The Q-functions were represented in a table, with one 
weight for each observation-action pair. The learn- 
ing rate is determined by a parameter, ao; the actual 
learning rate has an added factor that decays to over 
time: a = ao + where N is trial number. The 
temperature was also decayed in an ad hoc way, from 
Cmax down to Cniin witli an increment of 



1/(N-1) 



on each trial. In order to guarantee convergence of 
SARSA in MDPs, it is necessary to decay the tempera- 
ture in a way that is dependent on the Q-values them- 
selves [21 ; in the pomdp setting it is much less clear 
what the correct decay strategy is. In any case, we 
have found that the empirical performance of the al- 
gorithm is not particularly sensitive to the tempera- 
ture. The parameter b in the immidiate error epoiicy 
of VAPS was always set to 0. 

Experimental Protocol Each learning algorithm 
was executed for K runs; each run consisted of N tri- 
als, which began at the start state and executed until 
a terminal state was reached or M steps were taken. 
If the run was terminated at M steps, it was given 
a terminal reward of -1; M was chosen, in each case, 
to be 4 times the length of the optimal solution. At 
the beginning of each run, the weights were randomly 
reinitialized to small values. 

Results It was easy to make both algorithms work 
well on the first three problems: Baird's, McCallum's 
and small load-unload. The algorithms typically con- 
verged in fewer than 100 runs to an optimal policy. 
One thing to note here is that our version of VAPS, us- 
ing the true Boltzmann exploration distribution rather 



Load-Unload Problem with Two Loading Locations 
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Figure 5: Learning curves for VAPS and SARSA on the 
load- unload problem with two loading locations. 

than the one described by Baird and Moore, seems to 
perform significantly better than the original, accord- 
ing to results in their paper. 

Things were somewhat more complex with the last two 
problems (5 location load-unload with one or two load- 
ing locations). We experimented with parameters over 
a broad range and determined the following: 

• VAPS requires a value of (3 equal or very nearly 
equal to 1; these problems are highly non- 
Markovian, so the Bellman error is not at all use- 
ful as a criterion. 

• For similar reasons, A = 1 is best for SARSA(A). 

• Exploration was simplified by setting e to 0; em- 
pirically, c,nax = 1-0 and Cmin = 0.2 workcd 
well for VAPS in both problems, and c,nax — 0.2, 
Cmin = 0.1 worked well for SARSA(A). 

• A base learning rate of ao = 0.5 worked well for 
both algorithms in both domains. 

Figures ^ and |^ show learning curves for both al- 
gorithms, averaged over 50 runs, on the load-unload 
problem with one or two loading locations. Each run 
consisted of 1,000 trials. The vertical axis shows the 
number of steps required to reach the goal, with the 
terminated trials considered to have taken M steps. 

On the original load-unload problem, the algorithms 
perform essentially equivalently. Most runs of the al- 
gorithm converge to the optimal trial length of 9 and 
stay there; occasionally, however it reaches 9 and then 
diverges. This can probably be avoided by decreas- 
ing the learning rate more steeply. When we add the 
second loading location, however, there is a significant 



diff'erence. VAPS(l) consistently converges to a near- 
optimal policy, but SARSA(I) docs not. The idea is 
that sometimes, even when the policy is pretty good, 
the agent is going to pick up the wrong load due to ex- 
ploration and get punished for it. SARSA will punish 
all the state-action pairs equally; VAPS(l) will pun- 
ish the bad state-action pair more due to the different 
principle of credit assignment. 

4 Conclusions 

As Martin and Liftman showed, small pomdps can 
be solved effectively using stigmergic policies. Learn- 
ing reactive policies in highly non-Markovian do- 
mains is not yet well-understood. We have seen 
that the VAPS algorithm, somewhat modified, can 
solve a collection of small pomdps, and that although 
SARSA(A) performs well on some pomdps, it is possi- 
ble to construct cases on which it fails. In a general- 
ization of this work, we applied the VAPS algorithm 
to the problem of learning general finite-state con- 
trollers (which encompass external-memory policies) 

for POMDPS 10. 
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